Note: There are often multiple ways to answer each question.
caret
package. Also load the ggplot2
and dplyr
packages. Load the Sacramento
dataset with data(Sacramento)
. Plot a histogram of the house prices with 50 bins.library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(Sacramento)
ggplot(Sacramento, aes(x = price)) +
geom_histogram(bins = 50)
price
for each level of type
. Overlay this with a scatterplot with the same x
and y
aesthetics, but define the col
by factor(beds)
and introduce jitter and alpha for the points. Make some observations.There are many more residential homes compared to condos and multi-family homes. Conds seem to be the cheapest on average. Multi-family homes and residential homes seem to be a bit more expensive on average, but there is a lot more variance in residential home prices.
ggplot(Sacramento, aes(x = type, y = price)) +
geom_violin() +
geom_point(aes(col = factor(beds)), position = "jitter", alpha = 0.5)
The code below shows that the \(p\)-value is 0.26, which means that we don’t have enough evidence to reject the null hypothesis of the means being the same.
mf_data <- Sacramento[Sacramento$type == "Multi_Family",]$price
rs_data <- Sacramento[Sacramento$type == "Residential",]$price
t.test(mf_data, rs_data)
##
## Welch Two Sample t-test
##
## data: mf_data and rs_data
## t = -1.1849, df = 12.884, p-value = 0.2574
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -80384.82 23472.20
## sample estimates:
## mean of x mean of y
## 224534.7 252991.0
The \(p\)-value is 0.61, which means that we don’t have enough evidence to reject the null hypothesis of the distributions being the same.
ks.test(mf_data, rs_data)
## Warning in ks.test(mf_data, rs_data): p-value will be approximate in the
## presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: mf_data and rs_data
## D = 0.21292, p-value = 0.607
## alternative hypothesis: two-sided
price
for each level of beds
. Make another boxplot of price
for each level of baths
. Do you see a trend? (Note: You will want to plot factor(beds)
and factor(baths)
, as plotting separate boxplots for different values of a numeric variable doesn’t work.)The more baths or beds, the higher the price of the home.
ggplot(Sacramento, aes(x = factor(beds), y = price)) +
geom_boxplot()
ggplot(Sacramento, aes(x = factor(baths), y = price)) +
geom_boxplot()
The final model is \(price = -3,679 + 22,163 \cdot beds + 88,6572 \cdot baths\). This suggests that adding one bedroom increases the price of the house by about $22k, while adding an additional bathroom increases the price of the house by about $88k.
fit <- lm(price ~ beds + baths, data = Sacramento)
summary(fit)
##
## Call:
## lm(formula = price ~ beds + baths, data = Sacramento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315683 -67015 -20952 49048 540104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3679 13455 -0.273 0.785
## beds 22163 5203 4.260 2.26e-05 ***
## baths 86572 6394 13.540 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106300 on 929 degrees of freedom
## Multiple R-squared: 0.3442, Adjusted R-squared: 0.3428
## F-statistic: 243.8 on 2 and 929 DF, p-value: < 2.2e-16
fit <- lm(price ~ beds * baths, data = Sacramento)
summary(fit)
##
## Call:
## lm(formula = price ~ beds * baths, data = Sacramento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333773 -67872 -21501 49655 540493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28511 33606 0.848 0.396
## beds 12621 10507 1.201 0.230
## baths 70584 16578 4.258 2.27e-05 ***
## beds:baths 4463 4269 1.045 0.296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106300 on 928 degrees of freedom
## Multiple R-squared: 0.3449, Adjusted R-squared: 0.3428
## F-statistic: 162.9 on 3 and 928 DF, p-value: < 2.2e-16
zip
, latitude
and longitude
from the dataset and save the result to df
.df <- Sacramento %>% select(-zip, -latitude, -longitude)
df
that could help to predict price. Fit an additive linear model (with no interaction terms) of price against all other columns in df
. (You will have to google around to find out how to do this!)fit <- lm(price ~ ., data = df)
summary(fit)
##
## Call:
## lm(formula = price ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -245958 -43670 -9165 29768 401858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23330.328 17515.030 1.332 0.183195
## cityAUBURN 104667.280 34897.902 2.999 0.002782 **
## cityCAMERON_PARK 46072.765 26970.453 1.708 0.087935 .
## cityCARMICHAEL 85439.901 20345.056 4.200 2.94e-05 ***
## cityCITRUS_HEIGHTS 9665.403 17466.514 0.553 0.580151
## cityCOOL 93831.945 72605.464 1.292 0.196570
## cityDIAMOND_SPRINGS 30514.087 72616.662 0.420 0.674435
## cityEL_DORADO 49692.662 52187.963 0.952 0.341261
## cityEL_DORADO_HILLS 106484.720 20120.488 5.292 1.52e-07 ***
## cityELK_GROVE 10055.055 14184.769 0.709 0.478595
## cityELVERTA -39524.939 37939.080 -1.042 0.297787
## cityFAIR_OAKS 75469.590 26936.592 2.802 0.005193 **
## cityFOLSOM 128168.817 21454.619 5.974 3.35e-09 ***
## cityFORESTHILL 18242.664 72625.771 0.251 0.801727
## cityGALT 6525.733 19985.347 0.327 0.744103
## cityGARDEN_VALLEY 157438.155 72835.709 2.162 0.030919 *
## cityGOLD_RIVER 82640.091 38107.512 2.169 0.030377 *
## cityGRANITE_BAY 302694.271 43417.787 6.972 6.10e-12 ***
## cityGREENWOOD 6146.228 73102.612 0.084 0.933014
## cityLINCOLN 3935.477 19873.987 0.198 0.843073
## cityLOOMIS 373492.368 52402.185 7.127 2.11e-12 ***
## cityMATHER -52016.806 72700.467 -0.715 0.474491
## cityMEADOW_VISTA 62105.143 72769.260 0.853 0.393638
## cityNORTH_HIGHLANDS -26321.587 20156.623 -1.306 0.191940
## cityORANGEVALE 55858.065 24920.944 2.241 0.025245 *
## cityPENRYN 318538.623 72614.473 4.387 1.29e-05 ***
## cityPLACERVILLE 110008.457 25876.561 4.251 2.35e-05 ***
## cityPOLLOCK_PINES 46047.983 43199.542 1.066 0.286741
## cityRANCHO_CORDOVA -7441.958 18438.617 -0.404 0.686599
## cityRANCHO_MURIETA -30755.686 43311.085 -0.710 0.477821
## cityRIO_LINDA -2378.510 23533.040 -0.101 0.919517
## cityROCKLIN 88921.249 21465.959 4.142 3.76e-05 ***
## cityROSEVILLE 62907.351 16279.920 3.864 0.000120 ***
## citySACRAMENTO -629.545 12999.363 -0.048 0.961385
## cityWALNUT_GROVE 144896.791 72882.393 1.988 0.047108 *
## cityWEST_SACRAMENTO -17779.062 43314.173 -0.410 0.681562
## cityWILTON 185864.842 35372.181 5.255 1.86e-07 ***
## beds -21735.472 4349.106 -4.998 6.99e-07 ***
## baths 6576.108 5427.047 1.212 0.225938
## sqft 131.523 6.359 20.684 < 2e-16 ***
## typeMulti_Family 987.870 23982.425 0.041 0.967153
## typeResidential 43262.632 11147.855 3.881 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71520 on 890 degrees of freedom
## Multiple R-squared: 0.7156, Adjusted R-squared: 0.7025
## F-statistic: 54.63 on 41 and 890 DF, p-value: < 2.2e-16
price
against sqft
and include the linear model fit on the plot (without standard error intervals). Make the title and axis labels informative. Fit the model as well and look at the R-squared value (how well sqft
predicts price
).ggplot(df, aes(x = sqft, y = price)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Price vs. Square Footage", x = "Sq. Ft.", y = "Price")
fit <- lm(price ~ sqft, data = df)
summary(fit)
##
## Call:
## lm(formula = price ~ sqft, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -231889 -54717 -11822 38993 600141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13859.393 6948.714 1.995 0.0464 *
## sqft 138.546 3.796 36.495 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84130 on 930 degrees of freedom
## Multiple R-squared: 0.5888, Adjusted R-squared: 0.5884
## F-statistic: 1332 on 1 and 930 DF, p-value: < 2.2e-16
It is worth noting that the model with sqft
alone has \(R^2 = 0.59\), while the model with all the variables thrown into it had a modestly better \(R^2 = 0.72\). This suggests that sqft
is the single driving factor of price.
city_df <- Sacramento %>%
group_by(city) %>%
summarize(houses = n(), price = median(price),
lat = mean(latitude), long = mean(longitude))
ggplot(city_df, aes(x = long, y = lat)) +
geom_point(aes(col = price, size = houses)) +
scale_color_distiller(palette = "YlOrRd")
It would be even more informative if we could have the base map of Sacramento! Unfortunately, the main package used to do this (ggmap
) seems a difficult to use at the moment. We make do with the county map from the maps
package.
library(ggrepel)
library(maps)
county_df <- map_data("county") %>%
filter(region == "california")
ggplot(mapping = aes(x = long, y = lat)) +
geom_polygon(data = county_df, mapping = aes(group = group),
fill = "black", col = "white") +
geom_point(data = city_df, aes(col = price, size = houses)) +
geom_text_repel(data = city_df, aes(label = city, col=price), size = 3) +
scale_color_distiller(palette = "Spectral") +
coord_quickmap(xlim = c(-121.6, -120.3), ylim = c(38.2, 39.1))